Biostatistics For Dummies (Monika Wahi John Pezzullo)

scale examples we describe in the earlier section “Looking at Levels of Measurement”), data storage

gets even more interesting. First, you have to ask yourself, “Is this variable a Choose only one or

Choose all that apply variable?” The coding is completely different for these two kinds of multiple-

choice variables.

You handle the Choose only one situation just as we describe for Type of Caregiver in the preceding

section — you establish numeric code for each alternative. For the Likert scale example, if the item

asked about patient satisfaction, you could have a categorical variable called PatSat, with five

possible values: 1 for strongly disagree, 2 for somewhat disagree, 3 for neither agree nor disagree, 4

for somewhat agree, and 5 for strongly agree. And for the Type of Caregiver example, if only one kind

of caregiver is allowed to be chosen from the three choices of nurse, physician, or social worker, you

can have a categorical variable called CaregiverType with three possible values: 1 for nurse, 2 for

physician, and 3 for social worker. Depending upon the study, you may also choose to add a 4 for

other, and a 9 for unknown (9, 99, and 999 are codes conventionally reserved for unknown). If you find

unexpected values, it is important to research and document what these mean to help future analysts

encountering the same data.

But the situation is quite different if the variable is Choose all that apply. For the Type of Caregiver

example, if the patient is being served by a team of caregivers, you have to set up your database

differently. Define separate variables in the database (separate columns in Excel) — one for each

possible category value. Imagine that you have three variables called Nurse, Physician, and SW (the

SW stands for social worker). Each variable is a two-value category, also known as a two-state flag,

and is populated as 1 for having the attribute and 0 for not having the attribute. So, if participant 101’s

care team includes only a physician, participant 102’s care team includes a nurse and a physician, and

participant 103’s care team includes a social worker and a physician, the information can be coded as

shown in the following table.

Subject Nurse Physician SW

101

102

103

If you have variables with more than two categories, missing values theoretically can be indicated by

leaving the cell blank, but blanks are difficult to analyze in statistical software. Instead, categories

should be set up for missing values so they can be part of the coding system (such as using a numerical

code to indicate unknown, refused, or not applicable). The goal is to make sure that for every

categorical variable, a numerical code is entered and the cell is not left blank.

Never try to cram multiple choices into one column! For example, don’t enter 1, 2 into a cell

in the CaregiverType column to indicate the patient has a nurse and physician. If you do, you have

to painstakingly split your single multi-valued column into separate two-state flag columns

(described earlier) before you analyze the data. Why not do it right the first time?

Recording numerical data